RCAS (RNA Centric Annotation System) is an automated system that provides dynamic annotations for custom
input files that contain transcriptomic target regions. Such transcriptomic target regions could be, for instance, peak regions detected by
CLIP-Seq analysis that detect protein-RNA interactions, MeRIP-Seq analysis that detect RNA modifications (alias the epitranscriptome), or any
collection of target regions at the level of the transcriptome.
RCAS overlays the input target regions with the annotated protein-coding genes and calculates the Gene Ontology (GO) terms that may be enriched or
depleted in the input target regions compared to the background list of protein-coding genes. A Classical Fisher's Exact Test is applied for each GO term and the p-values
obtained for each GO term is corrected for multiple testing using both the False Discovery Rate and the Family-Wise Error Rate.
Similarly to the GO term enrichment analysis, RCAS also detects sets of genes as annotated in the Molecular Signatures
Database that are enriched or depleted in the queried target regions. Results are corrected for multiple-testing according to both the False Discovery Rate and the Family-Wise Error Rate.
Figure 1: The number of query regions that overlap different kinds of gene features are counted. The ‘y’ axis denotes the types of gene features included in the analysis and the ‘x’ axis denotes the percentage of query regions (out of total number of query regions denoted with “n”) that overlap at least one genomic interval that host the corresponding feature. Notice that the sum of the percentage values for different features don’t add up to 100%, because some query regions may overlap multiple kinds of features. If the query regions don’t overlap any gene features, they are classified as “intergenic”.
Figure 2: The number of query regions that overlap different kinds of RNA genes are counted. The ‘y’ axis denotes the types of gene features included in the analysis and the ‘x’ axis denotes the percentage of query regions (out of total number of query regions denoted with “n”) that overlap at least one genomic interval that host the corresponding RNA gene type. Notice that the sum of the percentage values for different RNA genes don’t add up to 100%, because some query regions may overlap multiple kinds of RNA genes.
Figure 3: The number of query regions that overlap different kinds of gene types are counted. The ‘x’ axis denotes the types of genes included in the analysis and the ‘y’ axis denotes the percentage of query regions (out of total number of query regions denoted with “n”) that overlap at least one genomic interval that host the corresponding gene type. If the query regions don’t overlap any known genes, they are classified as “Unknown”.
Figure 4: The number of query regions that overlap different chromosomes are counted. For each chromosome, the frequency of query regions are further split into groups based on the gene features the query overlaps with. The ‘x’ axis denotes the chromosomes included in the analysis and the ‘y’ axis denotes the frequency of overlaps.
Figure 6: The query regions are overlaid with the genomic coordinates of transcripts. The transcripts are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Transcripts shorter than 100bp are excluded. Thus, a coverage profile of the transcripts is obtained based on the distribution of the query regions. The strandedness of the transcripts are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.
Figure 7: The query regions are overlaid with the genomic coordinates of each exon of each transcript. The exons are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Exons shorter than 100bp are excluded. Thus, a coverage profile of the exons is obtained based on the distribution of the query regions. The strandedness of the exons are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.
Figure 8: The query regions are overlaid with the genomic coordinates of each exon-intron junction of each transcript. The junction comprises of a 50 bp region of an exon and 50 bp region of its neighboring intron. Exon-intron junctions are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Exons shorter than 100bp are excluded. Thus, a coverage profile of the exon-intron junctions is obtained based on the distribution of the query regions. The strandedness of the exons are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.
Figure 9: The query regions are overlaid with the genomic coordinates of each intron of each transcript. The introns are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Introns shorter than 100bp are excluded. Thus, a coverage profile of the introns is obtained based on the distribution of the query regions. The strandedness of the introns are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.
Figure 10: The query regions are overlaid with the genomic coordinates of each promoter region of each transcript. The promoter region is defined as the region spanning from 2000bp upstream of the transcription start site and the first 200bp region after the transcription start site. The promoters are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Thus, a coverage profile of the promoters is obtained based on the distribution of the query regions. The strandedness of the promoters are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.
Figure 11: The query regions are overlaid with the genomic coordinates of each 5’ UTR region of each transcript. The 5’ UTR regions are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Thus, a coverage profile of the 5’ UTR regions is obtained based on the distribution of the query regions. The strandedness of the promoters are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.
Figure 12: The query regions are overlaid with the genomic coordinates of each 3’ UTR region of each transcript. The 3’ UTR regions are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Thus, a coverage profile of the 3’ UTR regions is obtained based on the distribution of the query regions. The strandedness of the promoters are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.
Figure 13: The genomic sequences of the regions that are covered by each query region is extracted from the fasta file of the genome. Then, MEME was run to find enriched motif patterns in the list of query regions. The logos of the discovered motif patterns and the corresponding statistical test results are provided below.
MOTIF 1 MEME width = 8 sites = 517 llr = 2914 E-value = 4.1e-103
MOTIF 2 MEME width = 8 sites = 108 llr = 897 E-value = 7.0e-035
MOTIF 3 MEME width = 8 sites = 38 llr = 342 E-value = 1.9e+004
Figure 14: The frequency of the top 10 discovered motifs in the transcriptome is plotted.
Figure 15: The frequency of the top 10 discovered motifs in the transcriptome is plotted with respect to different types of genes.
Figure 16: The frequency of the top 10 discovered motifs in the transcriptome is plotted with respect to different types of gene features.
RCAS is developed by Dr. Altuna Akalin (head of the Scientific Bioinformatics Platform), Dr. Dilmurat Yusuf (Bioinformatics Scientist), and Dr. Bora Uyar (Bioinformatics Scientist) at the Berlin Institute of Medical Systems Biology (BIMSB) at the Max-Delbrueck-Center for Molecular Medicine (MDC) in Berlin.
RCAS is developed as a bioinformatics service as part of the RNA Bioinformatics Center, which is one of the eight centers of the German Network for Bioinformatics Infrastructure (de.NBI).